reverse kl
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- (2 more...)
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Chen, Howard, Razin, Noam, Narasimhan, Karthik, Chen, Danqi
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- (2 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (2 more...)
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL
Yao, Wei, Yang, Wenkai, Wang, Ziqiao, Lin, Yankai, Liu, Yong
As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last layer, reverse KL uniquely guarantees that it outperforms its weak supervisor by the magnitude of their disagreement-a guarantee that forward KL cannot provide. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to consistently outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- Asia > Middle East > Jordan (0.04)
- Asia > China (0.04)
Flow-based sampling for multimodal and extended-mode distributions in lattice field theory
Hackett, Daniel C., Hsieh, Chung-Chun, Pontula, Sahil, Albergo, Michael S., Boyda, Denis, Chen, Jiunn-Wei, Chen, Kai-Feng, Cranmer, Kyle, Kanwar, Gurtej, Shanahan, Phiala E.
Recent results have demonstrated that samplers constructed with flow-based generative models are a promising new approach for configuration generation in lattice field theory. In this paper, we present a set of training- and architecture-based methods to construct flow models for targets with multiple separated modes (i.e.~vacua) as well as targets with extended/continuous modes. We demonstrate the application of these methods to modeling two-dimensional real and complex scalar field theories in their symmetry-broken phases. In this context we investigate different flow-based sampling algorithms, including a composite sampling algorithm where flow-based proposals are occasionally augmented by applying updates using traditional algorithms like HMC.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (6 more...)
- Energy (0.67)
- Government > Regional Government (0.45)
Reviews: Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness
Detecting inputs that are outside the distribution of training examples, including adversarial inputs, is an important problem; reviewers and the area chair agree that this paper makes a useful algorithmic contribution towards solving this problem. The argument that reverse KL is conceptually correct, while forward KL as used previously is conceptually wrong, is significant. Training with reverse KL is a simple and compelling idea that practitioners can try easily. For these reasons the paper is being accepted so that the community can benefit from it quickly, despite the fact that reviewers have identified ways in which the writing of the paper, and the empirical evaluation, need improvement. The authors are encouraged to improve the final version.
Mixed Noise and Posterior Estimation with Conditional DeepGEM
Hagemann, Paul, Hertrich, Johannes, Casfor, Maren, Heidenreich, Sebastian, Steidl, Gabriele
In numerous healthcare and other contemporary applications, the variables of primary interest are obtained through indirect measurements, such as in the case of Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). For some of these applications, the reliability of the results is of particular importance. The accuracy and trustworthiness of the outcomes obtained through indirect measurements are significantly influenced by two critical factors: the degree of uncertainty associated with the measuring instrument and the appropriateness of the (forward) model used for the reconstruction of the parameters of interest (measurand). In this paper, we consider Bayesian inversion to obtain the measurand from signals measured by the instrument and a noise model that mimics both the instrument noise and the error of the forward model.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Berlin (0.04)
Sample as You Infer: Predictive Coding With Langevin Dynamics
Zahid, Umais, Guo, Qinghai, Fountas, Zafeirios
It is well known that neuronal systems, including their dynamics and responses, are rife with noise at multiple levels (Faisal et al., 2008; Shadlen & Newsome, 1998). These sources of noise arise from, amongst other things, stochastic processes occuring at the sub-cellular level, impacting neuronal response through, for example, fluctuations in membrane-potential (Derksen & Verveen, 1966). Yet the precise role of such randomness, in information processing, continues to be an open question (McDonnell & Ward, 2011; Deco et al., 2013). The Langevin PC algorithm suggests one such role may be in the principled exploration of the latent space of hypotheses under one's generative model. Secondly, from the perspective of Langevin PC as an in-silico generative modelling algorithm we note a number of interesting avenues that we have not had the time to explore here. These include: Models with a hierarchy of stochastic variables, such as those found in most state of the art VAE models (Child, 2021; Vahdat & Kautz, 2021; Hazami et al., 2022). Which may require adopting a corresponding top-down hierarchical warm-start model. Automatic convergence criteria for determining when our Markov chain has converged to a certain level of error (Roy, 2020). Underdamped Langevin dynamics, which incorporate auxiliary momentum variables into the Langevin sampling to achieve an accelerated rate of convergence (Cheng et al., 2018; Ma et al., 2019).
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > United States > New York (0.04)
- (2 more...)
- Instructional Material (0.46)
- Research Report (0.43)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)